fix p2phttp worker thread leak with deleted repository LOCKCONTENT
authorJoey Hess <joeyh@joeyh.name>
Mon, 15 Sep 2025 15:59:37 +0000 (11:59 -0400)
committerJoey Hess <joeyh@joeyh.name>
Mon, 15 Sep 2025 16:03:44 +0000 (12:03 -0400)
p2phttp: Fix a hang that could occur when used with --directory, and a
repository in the repository got removed.

It could leak up to -J number of worker threads, but this only affected a
client trying to access the deleted repository.

It may be that this could also affect a non-deleted repository, and also
leak a worker thread, if invalid p2p protocol is sent.

CHANGELOG
P2P/Http/Server.hs
doc/todo/p2phttp_serve_multiple_repositories.mdwn
doc/todo/p2phttp_serve_multiple_repositories/comment_3_429520e5411c5785b63598ffee7dbb95._comment [new file with mode: 0644]

index cb3a0e0c153152777e9baf98d5756ac67e761aea..371f53f30f285e74722f189f800338d4d313cdf1 100644 (file)
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -13,6 +13,8 @@ git-annex (10.20250829) UNRELEASED; urgency=medium
   * Improve performance when used with a local git remote that has a
     large working tree.
   * Removed support for building with cryptonite, use crypton.
+  * p2phttp: Fix a hang that could occur when used with --directory,
+    and a repository in the repository got removed.
 
  -- Joey Hess <id@joeyh.name>  Fri, 29 Aug 2025 12:34:06 -0400
 
index 6e3d53030330925fc58a9585f9aee6ce8af3bdc1..88e6fa336795a1f8ed196dffe6e9eabad5ed105c 100644 (file)
@@ -477,14 +477,19 @@ serveLockContent mst su apiver (B64Key k) cu bypass sec auth = do
        let lock = do
                lockresv <- newEmptyTMVarIO
                unlockv <- newEmptyTMVarIO
+               -- A single worker thread takes the lock, and keeps running
+-              -- until unlock in order to keep the lock held.
                annexworker <- async $ inAnnexWorker st $ do
                        lockres <- runFullProto (clientRunState conn) (clientP2PConnection conn) $ do
                                net $ sendMessage (LOCKCONTENT k)
                                checkSuccess
                        liftIO $ atomically $ putTMVar lockresv lockres
-                       liftIO $ atomically $ takeTMVar unlockv
-                       void $ runFullProto (clientRunState conn) (clientP2PConnection conn) $ do
-                               net $ sendMessage UNLOCKCONTENT
+                       case lockres of
+                               Right True -> do
+                                       liftIO $ atomically $ takeTMVar unlockv
+                                       void $ runFullProto (clientRunState conn) (clientP2PConnection conn) $ do
+                                               net $ sendMessage UNLOCKCONTENT
+                               _ -> return ()
                atomically (takeTMVar lockresv) >>= \case
                        Right True -> return (Just (annexworker, unlockv))
                        _ -> return Nothing
index f2ad9c752e34539d132ab0770cadaea79e548b04..47cf8ad8fdfaa246867f491807f1764c3f1e3ef2 100644 (file)
@@ -19,3 +19,5 @@ I asked matrss if this would be useful for forgejo-aneksajo and he said
 very useful, although I think I can work with the limitation [of only 1]."
 
 [[!tag projects/INM7]]
+
+> [[done]] --[[Joey]]
diff --git a/doc/todo/p2phttp_serve_multiple_repositories/comment_3_429520e5411c5785b63598ffee7dbb95._comment b/doc/todo/p2phttp_serve_multiple_repositories/comment_3_429520e5411c5785b63598ffee7dbb95._comment
new file mode 100644 (file)
index 0000000..c3dee29
--- /dev/null
@@ -0,0 +1,16 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2025-09-15T15:18:15Z"
+ content="""
+Seems the bug is specific to LOCKCONTENT. When doing other operations,
+like CHECKPRESENT after the repo is deleted, the server returns
+FAILURE and continues being able to serve more requests for that repo.
+
+Ah, the problem is that serveLockContent is running a block of actions in
+a single inAnnexWorker call, which first sends on the LOCKCONTENT, then
+blocks waiting for the unlock to arrive. Which never happens, so it remains
+blocked there forever, consuming a worker thread.
+
+Fixed that, finally.
+"""]]